Purpose

The purpose of this notebook is to compare pKa and binarized sequence as predictors of drug resistance.

Sequence Feature Tested:
- Amino acid pKa values
- binarized pKa values
Model:
- Random Forest Regressor



In [41]:

    
# Jupyter notebook magic statements
%matplotlib inline
%load_ext autoreload
%autoreload 2









    



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



In [42]:

    
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set_style('white')

from Bio import SeqIO
from util.isoelectric_point import *  # imported from the 'util/' directory
from sklearn.cross_validation import cross_val_score, ShuffleSplit, train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error as mse
from sklearn.preprocessing import LabelBinarizer



In [43]:

    
# List the files present in the "data" folder
! ls data/









    



20150902_nnet_ha.fasta                hiv-nnrt-data.csv                     hiv-protease-data-expanded.csv        hiv-protease.pdb
HIV-1_pol.fasta                       hiv-nrt-data.csv                      hiv-protease-data.csv                 hiv-rt-consensus.fasta
amino_acid_properties.csv             hiv-protease-consensus.fasta          hiv-protease-sequences-expanded.fasta



In [44]:

    
# Load the protease resistance data into memory.
# [ ] TODO: Refactor this function out as a load_data() function.
resistance = pd.read_csv('data/hiv-protease-data-expanded.csv', index_col=0)
drug_cols = ['NFV', 'SQV', 'FPV', 'IDV', 'LPV', 'ATV', 'TPV', 'DRV']
resistance.set_index('seqid', inplace=True)
resistance[drug_cols] = resistance[drug_cols].apply(np.log10)
resistance.head()









    Out[44]:






  
    
      
      ATV
      DRV
      FPV
      IDV
      LPV
      NFV
      SQV
      SeqID
      TPV
      sequence
      sequence_object
      weight
    
    
      seqid
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2996-0
      NaN
      NaN
      0.397940
      1.212188
      NaN
      1.586587
      1.206826
      2996
      NaN
      PQITLWQRPIVTIKIGGQLKEALLDTGADDTVLEDVNLPGRWKPKM...
      ID: 2996-0\nName: <unknown name>\nDescription:...
      0.50
    
    
      2996-1
      NaN
      NaN
      0.397940
      1.212188
      NaN
      1.586587
      1.206826
      2996
      NaN
      PQITLWQRPIVTIKIGGQLKEALLDTGADDTVLEDVNLPGRWKPKM...
      ID: 2996-1\nName: <unknown name>\nDescription:...
      0.50
    
    
      4387-0
      NaN
      NaN
      -0.154902
      -0.096910
      NaN
      -0.096910
      0.041393
      4387
      NaN
      PQITLWQRPLVTIKVGGQLKEALLDTGADDTVLEDMELPGRWKPKM...
      ID: 4387-0\nName: <unknown name>\nDescription:...
      0.25
    
    
      4387-1
      NaN
      NaN
      -0.154902
      -0.096910
      NaN
      -0.096910
      0.041393
      4387
      NaN
      PQITLWQRPLVTIKVGGQLKEALLDTGADDTVLEDMELPGRWKPKM...
      ID: 4387-1\nName: <unknown name>\nDescription:...
      0.25
    
    
      4387-2
      NaN
      NaN
      -0.154902
      -0.096910
      NaN
      -0.096910
      0.041393
      4387
      NaN
      PQITLWQRPLVTIKVGGQLKEALLDTGADDTVLEDMELPGRWKPKM...
      ID: 4387-2\nName: <unknown name>\nDescription:...
      0.25



In [45]:

    
# Count the number of NaNs in each of the drug resistance columns.
pd.isnull(resistance[drug_cols]).sum().sort_values()









    Out[45]:





NFV     111
SQV     316
FPV     319
IDV     325
LPV     930
ATV    1928
TPV    2650
DRV    4141
dtype: int64



In [46]:

    
# Convert the FASTA file of sequences into a pandas dataframe.
# [ ] TODO: Refactor out into a load_sequences() function.
sequences = [np.array(s) for s in SeqIO.parse('data/hiv-protease-sequences-expanded.fasta', 'fasta')]
seqids = [s.id for s in SeqIO.parse('data/hiv-protease-sequences-expanded.fasta', 'fasta')]
sequences = pd.DataFrame(sequences)
sequences = sequences.replace('*', np.nan)
sequences.index = seqids
sequences.dropna(inplace=True)

# Because there's a lower-case character somewhere, force all of the characters to be upper-case.
for col in sequences.columns:
    sequences[col] = sequences[col].apply(lambda x: str(x).upper())



In [47]:

    
len(sequences)









    Out[47]:





6900



In [48]:

    
# Commands to inspect the utility scripts
# ! ls util/
! cat util/isoelectric_point.py









    



"""
Author: Eric J. Ma
Date: 2015-03-17
"""

isoelectric_points = {
    'A':6.11,
    'R':10.76,
    'N':10.76,
    'D':2.98,
    'C':5.02,
    'E':3.08,
    'Q':5.65,
    'G':6.06,
    'H':7.64,
    'I':6.04,
    'L':6.04,
    'K':9.74,
    'M':5.74,
    'F':5.91,
    'P':6.30,
    'S':5.68,
    'T':5.60,
    'W':5.88,
    'Y':5.63,
    'V':6.02,
    'X':7.00,  # unknown so assign neutral
    'B':6.87,  # the average of D and N
    'Z':4.35,  # the average of E and Q
    'J':6.04,  # the average of I and L
}

def get_isoelectric_point(aa):
    return isoelectric_points[aa]



In [49]:

    
# Replace the letters with amino acid pKas
# Also ensure that "*" are replace with NaN, which are then dropped later.
pKa_features = sequences.replace(isoelectric_points.keys(), isoelectric_points.values())
pKa_features.head()









    Out[49]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      89
      90
      91
      92
      93
      94
      95
      96
      97
      98
    
  
  
    
      2996-0
      6.3
      5.65
      6.04
      5.6
      6.04
      5.88
      5.65
      10.76
      6.3
      6.04
      ...
      5.74
      5.6
      5.65
      6.04
      6.06
      5.02
      5.6
      6.04
      10.76
      5.91
    
    
      2996-1
      6.3
      5.65
      6.04
      5.6
      6.04
      5.88
      5.65
      10.76
      6.3
      6.04
      ...
      5.74
      5.6
      5.65
      6.04
      6.06
      5.02
      5.6
      6.04
      10.76
      5.91
    
    
      4387-0
      6.3
      5.65
      6.04
      5.6
      6.04
      5.88
      5.65
      10.76
      6.3
      6.04
      ...
      6.04
      5.6
      5.65
      6.04
      6.06
      5.02
      5.6
      6.04
      10.76
      5.91
    
    
      4387-1
      6.3
      5.65
      6.04
      5.6
      6.04
      5.88
      5.65
      10.76
      6.3
      6.04
      ...
      6.04
      5.6
      5.65
      6.04
      6.06
      5.02
      5.6
      6.04
      10.76
      5.91
    
    
      4387-2
      6.3
      5.65
      6.04
      5.6
      6.04
      5.88
      5.65
      10.76
      6.3
      6.04
      ...
      6.04
      5.6
      5.65
      6.04
      6.06
      5.02
      5.6
      6.04
      10.76
      5.91
    
  

5 rows × 99 columns



In [72]:

    
# Join in the resistance data for FPV

# [ ] TODO: Refactor lines 5 through 14 into a get_XY_matrix() function.
# Tunable parameters
X_cols = [i for i in range(0,99)]
drug = 'LPV'
weight_col = 'weight'

# Get the final data matrix, and split into X and Y
data_matrix = pKa_features.join(resistance[[drug, weight_col]], how='inner').dropna()
print(len(data_matrix))
X = data_matrix[X_cols]
Y = data_matrix[drug]
W = data_matrix[weight_col]

# Perform a train/test split
cv = ShuffleSplit(n=len(X), n_iter=10, test_size=0.2)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Compute the following:
mdl = RandomForestRegressor(n_estimators=100)

# Compute the score as the mean absolute error
scores = cross_val_score(mdl, X, Y, cv=cv, scoring='mean_absolute_error', n_jobs=-1)
np.abs(np.mean(scores))









    



5970






    Out[72]:





0.070275725049439255



In [73]:

    
np.std(scores)









    Out[73]:





0.0041089622602374781



In [74]:

    
# Do a fit/predict on the X and Y matrices.
def train_random_forest_regressor(X_train, Y_train, W, X_test, Y_test):
    mdl = RandomForestRegressor(n_estimators=100)
    mdl.fit(X_train, Y_train, sample_weight=W.ix[X_train.index].values)
    preds = mdl.predict(X_test)
    score = mdl.score(X_test, Y_test)
    
    return preds, score

preds, score = train_random_forest_regressor(X_train, Y_train, W, X_test, Y_test)



In [75]:

    
# Plot the predicted against actual.

def plot_test_vs_preds(Y_test, preds):
    fig = plt.figure(figsize=(4,4))
    plt.scatter(Y_test, preds)
    plt.xlabel('Actual')
    plt.ylabel('Predicted')
    plt.plot([min(Y_test),max(Y_test)], [min(Y_test),max(Y_test)], color='r', lw=3)
    plt.annotate('r2: {0:.02f}'.format(score), xy=(2,0))
    plt.title('{0}, mse: {1:.03f}'.format(drug, mse(preds, Y_test)))
    
plot_test_vs_preds(Y_test, preds)



In [76]:

    
# The average MSE is 0.02, which is in log-transformed units.
# The average mean absolute error is 0.08
# What is the actual mean error in non log-transformed units?
print('The actual mean error is:')
np.power(10, np.abs(np.mean(scores)))









    



The actual mean error is:






    Out[76]:





1.1756437111907017

What about binarized protein sequence?



In [77]:

    
# We will use the scikit-learn label binarizer to binarize the amino acids into a letter matrix.
aa_letters = [i for i in isoelectric_points.keys()]

lb = LabelBinarizer()
lb.fit(aa_letters)

bin_features = pd.DataFrame()
X_cols = []

for col in sequences.columns:
    transformed = lb.transform(sequences[col])
    
    for i, letter in enumerate(lb.classes_):
        colname = '{0}_{1}'.format(col, letter)
        bin_features[colname] = transformed[:, i]
        X_cols.append(colname)

bin_features.index = sequences.index
bin_features.head()









    Out[77]:






  
    
      
      0_A
      0_B
      0_C
      0_D
      0_E
      0_F
      0_G
      0_H
      0_I
      0_J
      ...
      98_P
      98_Q
      98_R
      98_S
      98_T
      98_V
      98_W
      98_X
      98_Y
      98_Z
    
  
  
    
      2996-0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2996-1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4387-0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4387-1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4387-2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 2376 columns



In [78]:

    
data_matrix = bin_features.join(resistance[[drug, weight_col]]).dropna()
print(len(data_matrix))

X = data_matrix[X_cols]
Y = data_matrix[drug]
W = data_matrix[weight_col]

# Perform a train/test split
cv = ShuffleSplit(n=len(X), n_iter=10, test_size=0.2)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

# Compute the following:
mdl = RandomForestRegressor(n_estimators=100)

# Compute the score as the mean absolute error
scores = cross_val_score(mdl, X, Y, cv=cv, scoring='mean_absolute_error', n_jobs=-1)
np.abs(np.mean(scores))









    



5970






    Out[78]:





0.066340048817844915



In [79]:

    
np.std(scores)









    Out[79]:





0.0043725869577109708



In [80]:

    
# Do a fit/predict on the X and Y matrices.
preds, score = train_random_forest_regressor(X_train, Y_train, W, X_test, Y_test)



In [81]:

    
# Plot the predicted against actual.
plot_test_vs_preds(Y_test, preds)



In [82]:

    
print('The actual mean error is:')
np.power(10, np.abs(np.mean(scores)))









    



The actual mean error is:






    Out[82]:





1.1650378869505527



In [ ]:

Conclusions

Binarized sequence features vs. pKa values - basically the models perform similarly.

However, we are lacking interpretability. I would like to implement something like a convolutional network on graphical models of chemical structures.



In [ ]:

	ATV	DRV	FPV	IDV	LPV	NFV	SQV	SeqID	TPV	sequence	sequence_object	weight
seqid
2996-0	NaN	NaN	0.397940	1.212188	NaN	1.586587	1.206826	2996	NaN	PQITLWQRPIVTIKIGGQLKEALLDTGADDTVLEDVNLPGRWKPKM...	ID: 2996-0\nName: <unknown name>\nDescription:...	0.50
2996-1	NaN	NaN	0.397940	1.212188	NaN	1.586587	1.206826	2996	NaN	PQITLWQRPIVTIKIGGQLKEALLDTGADDTVLEDVNLPGRWKPKM...	ID: 2996-1\nName: <unknown name>\nDescription:...	0.50
4387-0	NaN	NaN	-0.154902	-0.096910	NaN	-0.096910	0.041393	4387	NaN	PQITLWQRPLVTIKVGGQLKEALLDTGADDTVLEDMELPGRWKPKM...	ID: 4387-0\nName: <unknown name>\nDescription:...	0.25
4387-1	NaN	NaN	-0.154902	-0.096910	NaN	-0.096910	0.041393	4387	NaN	PQITLWQRPLVTIKVGGQLKEALLDTGADDTVLEDMELPGRWKPKM...	ID: 4387-1\nName: <unknown name>\nDescription:...	0.25
4387-2	NaN	NaN	-0.154902	-0.096910	NaN	-0.096910	0.041393	4387	NaN	PQITLWQRPLVTIKVGGQLKEALLDTGADDTVLEDMELPGRWKPKM...	ID: 4387-2\nName: <unknown name>\nDescription:...	0.25

	0	1	2	3	4	5	6	7	8	9	...	89	90	91	92	93	94	95	96	97	98
2996-0	6.3	5.65	6.04	5.6	6.04	5.88	5.65	10.76	6.3	6.04	...	5.74	5.6	5.65	6.04	6.06	5.02	5.6	6.04	10.76	5.91
2996-1	6.3	5.65	6.04	5.6	6.04	5.88	5.65	10.76	6.3	6.04	...	5.74	5.6	5.65	6.04	6.06	5.02	5.6	6.04	10.76	5.91
4387-0	6.3	5.65	6.04	5.6	6.04	5.88	5.65	10.76	6.3	6.04	...	6.04	5.6	5.65	6.04	6.06	5.02	5.6	6.04	10.76	5.91
4387-1	6.3	5.65	6.04	5.6	6.04	5.88	5.65	10.76	6.3	6.04	...	6.04	5.6	5.65	6.04	6.06	5.02	5.6	6.04	10.76	5.91
4387-2	6.3	5.65	6.04	5.6	6.04	5.88	5.65	10.76	6.3	6.04	...	6.04	5.6	5.65	6.04	6.06	5.02	5.6	6.04	10.76	5.91